Galaxy Morphologies

Lab Assignment Two: Exploring Image Data


Richmond Aisabor

Business Understanding

To understand how the galaxies formed, researchers need to be able to classify them according to their shape. This dataset is a sample of the roughly one hundred billion galaxies in the observable universe. Each of these galaxies, containing billions of stars, has had a unique life and has interacted with it's environment in a different way. Through studying the morphology of galaxies, answers to how humans came to exist or the meaning of life become much clearer and easier to locate. In order to relate the different shapes of galaxies to the physical phenomenon that created them, such images must be classified. The overall goal is to understand the natural processes that created the universe by analyzing the formation of galaxies.

To be confident that the algorithm is learning properly, it must have a success rate better than 50%, a random chance. The goal for the algorithm is to be as close to 100% accuracy as possible and the user that benefits most from a successful algorithm is an astronomer. To be considered a successful algorithm, it must have a success rate of atleast 90%.

The dataset is called the Galaxy10 dataset. It contains 21,785 69x69 pixel colored images of galaxies and 10 distinct classes. Galaxy10 images were shot by The Sloan Digital Sky Survey (SDSS) and the classification labels were created by Galaxy Zoo.

Dataset source: https://astronn.readthedocs.io/en/latest/galaxy10.html

Data Preperation

Data Reduction

Principle Component Analysis

After running principle component anlaysis, only 500 principle components is needed to account for 95% of the variance between the images. To addequately represent the data, only 500 principle components are necessary. The explained variance plot after running PCA with 2000 components shows that 2000 principle components accounts for 99% of the variance between the images. After 500 principle components, the more principle components added, the more reduancy is added to the analysis. Therefore 500 principle components is needed to represent the images.

Randomized PCA

PCA versus Randomized PCA

The preferred method is either PCA or randomized PCA. This is because the time it takes to run a PCA is the same as a randomized PCA. Given it takes the same amount of time, it is safe to conclude that either PCA methods will represent the images. If there are more components, then it will take longer to perform a PCA. Since the amount of principle components for randomized PCA and PCA is 500, then PCA is running as efficiently as randomized PCA

PCA for Image Classification

Feature Extraction

DAISY Bag of Features Model

Performance

After find the accuracy scores for each feature extraction, it is apparent that PCA feature extraction is the most accurate at predicting images in the galaxy dataset. This PCA algorithm considers 500 principle components and the DAISY contains only 272 features. The increased accuracy of PCA is a result of a larger feature space.

Exceptional Work